PsL Monthly 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ PsL Monthly 1994 January / PSL Monthly Shareware CD-ROM (Public Software Library) (January 1994).iso / games / dos / misc / monster.com / VOICE.DOC < prev next >

Wrap

Text File | 1990-05-03 | 5.0 KB | 82 lines

VOICE DIGITIZATION AND REPRODUCTION ON THE IBM PC/XT AND PC/AT BUILT-IN SPEAKER -------------------------------------------- Alan D. Jones July 1988 The speaker on the PC and its associated driver circuitry is quite simple and crude, having been designed primarily for creating single square-wave tones of various audio frequencies. This speaker is typically driven by a pair of transistors used as current amplifier which is in turn driven directly by the output of a TTL gate. This results in only two possibilities of voltage across the voice coil: 0 volts and 5 volts. Any sound to be reproduced by this system must be reduced to an approximation in the form of a stream of constant-amplitude, variable-width rectangular pulses. Examination of a speech waveform on an oscilloscope display quickly tells us that it is not going to be possible to even remotely mimic this waveform under the above restrictions. Much of the information contained in the waveform is in the form of amplitude variations, and this is the one attribute we cannot reproduce. It is initially tempting to try to use the technique of the "class D" amplifier to create the waveform, using high-speed pulse width modulation and depending on the mechanical characteristics of the speaker and those of the human ear to provide the missing low-pass filtering. Assuming the sampling rate to be 8 KHz (based on the Nyquist criterion) and, to conserve memory, assuming the samples to contain only 4 bits of amplitude information (16 levels), we can see that data accumulates at a rate of 4k bytes per second, which is certainly acceptable. The problem comes when we try to play back the sound. Pulses occur at intervals of 125 microseconds, which doesn't seem too bad, but since each pulse can have 16 possible widths, it is necessary to time the pulses with a resolution of well under 8 microseconds. This is only a couple of instruction times on a 4.77 MHz XT, and even on a fast 80386 it doesn't give the CPU much time between bits to shift bits, read and increment a pointer, check the pointer to see if it's done yet, etc., not to mention the difficulty of servicing unrelated interrupts. The search for simpler (but still usable) and less CPU-intensive methods of reproducing speech leads to the question of what information in the waveform we can discard without an unacceptable loss of intelligibility. My experiments with running speech signals through a graphic equalizer revealed that the lower-frequency components, those which are most visible to the eye on the oscilloscope, are actually of minimal importance in understanding speech. This is also demonstrated by the fact that a whisper is just as understandable as normal speech, but does not make use of vibrating vocal chords, which are the primary source of low-frequency components in the voice. The digitizer circuit consists of two stages of voltage amplification with some high-pass filtering built into the coupling capacitors, followed by a differentiator. The output of the differentiator is fed to a voltage comparator, thus producing an output which has approximately the following relationship to the input from the microphone: If the derivative of the speech waveform if positive, then the output is logic zero; If the derivative of the speech waveform is negative, then the output is logic one. The transition timing at the output is entirely analog in nature; there is no synchronizing clock signal anywhere in the circuit. If the output of this circuit is connected directly to a speaker, the resulting sound will still be an understandable version of the input. Since the output consists of nothing but a digital bit stream, the job of the computer becomes that of simply recording and accurately reproducing this bit stream. The program operates by reprogramming the 8253 time chip to produce hardware interrupts at the 16.5 KHz rate. The interrupt service routine then manipulates the NAND gate driving the speaker based on bits read from the file. The 16.5 Khz rate was chosen by trial-and-error; this is the audible "point of diminishing returns", where a further increase in sampling rate didn't produce enough of an improvement to warrant the increased memory usage. This technique is somewhat limited in its usefulness. It necessitates the writing of a "badly behaved" program which not only reprograms the timer chip but also totally hogs the CPU for the duration of the voice output. Nevertheless, it demonstrates a few interesting things about how humans hear speech. I first developed this circuit over a year ago as a rebuttal to someone who said "it couldn't be done". Not only can it be done, it is actually quite simple. Certainly the circuit could be improved, at the possible expense of increased complexity. I'm waiting to hear from some of you. If anyone has questions, especially about my sloppy code, I check for messages on CIS every three or four days. - Alan 74030,554